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ABSTRACT 

Various aspects of Confidence Weighting are examined. 
Variant of Confidence Weighting, its effect on test reliability, and 
the validity of Confidence Weighting are discussed. (DG) 
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Inherent sources of unreliability in Objective Achievement Tests 









Under conventional objective achievement testing procedure, the fact 
that a student has selected the "correct" response symbol for a given item 
says little about how much he actually knows about that item# All correct 
response symbols look alike, no matter how or why they were selected# One 
student might have been able to supply the correct response, without hesi- 
tation, to an open-ended question on the point involved# Another might not 
have been able to supply such a response but did recognise it at once when 
it was supplied# Still another might have just barely preferred the correct 
response over an incorrect alternative# Finally, another student may have 
selected this correct response quite by chance in a desperate flurry of 
random response selections during the final few seconds of the test# Thus, 
under conventional objective achievement testing procedure, response selec- 
tions based on grossly disparate levels of relevant knowledge can receive 
the same score credit. A fortuitous "guess" receives full credit while 
relevant knowledge far beyond the minimum level required to divine the 
correct response cannot be manifested and, so, receives no extra credit. 

The possibility of guessing and the necessity for dichotomous scoring 
are both inherent in conventional objective testing procedure. Guessing 
operates to inflate scores randomly at the lowest ability levels while 
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dichotomous scoring operates to truncate scores systematically at the highest 
ability levels* The resultant effects of these two factors are to reduce the 
range of scores and to introduce a random variable— chance* Both of these 
effects reduce test reliability. All of this was fully recognized from the 
beginning of objective testing but the potential utility of this test format 
inspired a search for procedural strategies to meliorate these inherent faults* 

Potential Solutions 

Corrections for Guessing . The most obvious source of unreliability wa3 
guessing /tnd two so-called correction-for guessing strategies were developed. 
The well-known subtractive correction was intended to make guessing unprofit- « 
able; the less common additive correction was intended to make it unnecessary. 
Of course, it is impossible to "correct" for a random variable. If these two 
strategies have any effect on test reliability it is by inhibiting guessing 
on speeded tests and, even here, the effect will vary from testee to testee. 
However, most achievement tests are power tests and, as Gulliksen (1950) has 
pointed out, if every testee attempts every item, corrections for guessing 
have no effect on test reliability* In brief, these strategies are not the 
answer to all objective testing problems and may not be the answer to any. 

Confidence Weighting . Ideally*, an achievement test should permit the 
respondent to manifest all of the knowledge he has relevant to each item 
in the sample of items comprising the test* A dichotomously scored test . 
merely counts the number of times he had "enough" knowledge* For every 
item on which he had more than "enough" knowledge, such scoring truncates 
the continuous underlying variable we are trying to measure* Confidence 
Weighting (CW) 'was designed to permit the testee to manifest his "extra" 
knowledge. 
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Confidence Weighting 

Definition * CW is a special procedure for responding to objective test 
items, for scoring such responses, wherein the respondent who is willing 
to indicate high confidence in his response selection is awarded a specified 
extra point credit if, indeed, he is right but he incurs a specified point 
penalty if, in fact, he is wrong. This option is exercised independently on 
each item* This procedure can be applied to any so-called objective-type 
item— true-false, multiple-choice, matching, or objectively scorable comple- 
tion items* However, the empirical studies on CW reported in the literature 
* 

have been confined to multiple-choice or true-false tests, with the latter 
somewhat more common* 

Variants of CW. Oddly enough, the earliest studies on CW, dating from 
the mid- 30 's, employed the most elaborate procedures* In studies reported 
by Soderquist (1936) and Swineford (1938, 19U1) involving both true-false 
and multiple- choice tests respondents had the option, on each item, of indi- 
cating any one of four levels of confidence in their response selection* 

Each level of confidence carried a different pair of score contingencies. 

The lowest level of confidence would yield 1 point if right, 0 if wrong} 
the next higher level would yield 2 or -U} the next level 3 or -6; and the 
highest level, Ij. or -8* Different response symbols served to indicate the 
level of confidence the respondent wished to express in his response selec- 
tion* The score contingencies specified for the lowest level of confidence, 

1 or 0, will be recognised as those of conventional rights-only scoring* 
Dressel and Schmid . (1953) offered the following pairs of score contingencies 
on a multiple-choice test: 1 or -1, 2 or -2, 3 or -3, and U or -U* Jacobs 
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(1968) compared the effects of two different bonus-penalty ratios, offering 
one group 1 or 0 , 2 or -2, and 3 or -3, and the other group 1 or 0, 2 or -U, 
and 3 or -6. He found no signiflant differences between the risk-taking be- 
havior patterns of these two groups* Much of the recent research done on CW 
han involved o nly two levels of confidence— none and some* In a series of 
studies using CW with true-false tests, Ebel (1965) offered contingencies of 
1 or -1 and 2 or -2* It will be recognized that contingencies of 1 or -1 on 
h true-false test amount bo a conventional subtractive correction for guessing* 
In .addition, Ebel awarded *5 for each omission, which amounts to an additive 
connection for guessing. Thus, his procedure combined the features of CW and 
botti forms of correction for guessing* Garvin (1969) reported an extensive 

I 

study involving multiple-choice tests in which the only contingencies offered 

t 

were 1 or 0 and 2 or -2. On certain of the tests involved, a quota was set 

I 

such that a respondent could elect the "confident" option on no more than 
half the items* Of course, no minimum quota was ever set* Other patterns 
of contingencies and other special instructions have been employed in re- 
search on CW and in classroom testing practice but those described above are 
representative of the variants of CW in common use* 

The Effect of CW on Test Reliability . Regardless of the particular pro- 
cedure employed, the primary purpose of CW has been to improve test reliabil- 
ity* In almost every case report, it has done this, although the degree of 
improvement has varied widely from case to case* Moreover, widely disparate 
situational factors— test length, format, difficulty, and content, and respond- 
ent motivation— and, most important, disparate experimental methodologies, 
make it difficult to abstract generalizations from the studies cited here* 

Be all that as it may, the consensus of published reports on the effect of 
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CW on test reliability is that it does "work" to some degree. Farther, the 
two contrary instances this writer has encountered serve to confirm his own 
theory about why it works when it works# 

Since improvement in reliability has been the material dependent variable 
in this discussion thus far, it is necessary to provide a suitable metric for 
expressing this variable. Fortunately, two investigators in this field have 
independently arrived at the same metric for this purpose, although they have 
given it different names. Philip DuBois called it a Coefficient of Equivalent 
Length (CEL); Robert Ebel called it an Improvement Factor. Since the former 
is more explicitly descriptive, it will be used here. 

It should be recognised that a test administered under CW procedure 
yields two score distributions— a rights-only or raw score distribution 
and a weighted score distribution that embodies the score bonuses and pen- 
alties due to CW contingencies. If the reliability coefficient of the raw 
scores (r ) and of the weighted scores (r ) are computed by any appropriate 
common algorithm, these may be compared to provide a measure of reliability 
improvement (or decrement) due to CW. The CEL compares these two reliabilities 
in a rearrangement of the Spearman- Brown Prophecy formula, viz.! 

r w (l - r r ) 



The GEL is interpreted as the factor by which a conventionally administered 
test would have to be lengthened (or shortened) to yield the reliability of the 
same test administered under CW procedure. A CEL > 1.0 indicates that CW has 
••worked " \ a vlFI.di 1,0 \nd \ ontea that it has not. Tri thin connection, it mst 
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be remembered that the weighted scores on a test can be less reliable than 
the corresponding raw scores. 

The earliest studies on CW merely reported the r and r obtained and 

i W 

let these test statistics "speak for themselves." However, it is possible 
to reconstruct a CEL for each of these studies and so compare these with 
later studies on a common basis* The CELs attained in the several studies 
cited herein are tabled below. The studies are listed chronologically} in 
the two multiple-experiment studies, CELs are listed in order of magnitude* 

Hevner (1932) 1.72 

Soderquist (1936) 2.20 

Swineford (1938) 1.U8 

Dressel and Schmid (1953) 1*16 

Ebel (1965) 1.00 1.07 1.19 1.U8 1.72 1.81* 

Garvin (1969) .96 1.19 1*38 1.6U 1.81* 

As previously noted, these results, must be compared with caution in 
view of the disparate situations and methodologies involved. Nevertheless, 
the median CEL of 1.U8 may be regarded as a reasonable expectation for the 
degree of reliability improvement to be expected in a typical test situation. 

It will be noted that only one of the 1 $ CELs reported above is less 
than 1.0 and that only slightly so. Nevertheless, the possibility exists 
that the r^ 0 f a given test would be much lower than its r r . This raises 
both practical and ethical questions as to which set of scores should be 
used for various purposes. In anticipation of such a dilemma, this writer 
has made it a practice to advise his students that the more reliable of the 
tWo sets of scores would be used in determining grades. In over 80 testing 
events he has conducted under CW procedure, the weighted scores have been 
the more reliable in all but five cases. 
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The Effeot of CW on Variation of the Standard Error of Measurement . The 
discussion thus far has concerned improving the global reliability of a test* 
Mollenkopf (191*9) has shown that the standard error of measurement is not 
lik ely to be uniform over the range of scores in a distribution unless this 
distribution is normal * This is equivalent to saying that a test may be 
more reliable at one point in the score distribution than at another* Test 
results are generally used to partition testees at one or more points in the 
score distribution^ e*g*, assigning letter grades or selecting a high or low 
group for some speoial purpose* Accordingly, it may be more important to 
know how reliable our test is at the point or points where we are going to 
make our "cuts" than it is to know its "global" reliability* 

The typioal teacher-made, objective achievement test yields a negatively 
skewed raw score distribution* According to Mollenkopf 's formulations, such 
a test is relatively more reliable in the extended, lower tail of the score • 
distribution and relatively less reliable in the blunted, upper tail* If we 
are concerned only with the identification of some lowest group, this kind 
of test provides its highest reliability where it is most needed. If, how- 
ever, a cut must be made somewhere in the upper end of the score distribution, 
the effective reliability of the test at this point is typically less than the 
global reliability of the test. It is not uncommon that a test is designed 
for one purpose and, sooner or later, its results are used for one or more 
other purposes* Against this possibility, the most desirable situation is 
that its reliability be high and uniform over the full range of scores* To 
attain this situation with the typical teacher-made, objective achievement 
test, the reliability of the upper end of the score distribution must be 
improved without simultaneously depressing the reliability of the lower end* 
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Garvin (1969) studied the effect of CW on variation of the error of 
measurement (over the range of test scores) --to quote the title of his 
dissertation* Eight sections of highly motivated, highly intelligent 
young men took each of five different tests (in trigonometry, spelling, 
and three aspects of English) under CW procedure. In 30 of these UO 
section-by-test events, the variation of the error of measurement over 
the range of test scores was decreased by CW$ when the eight sections were 
pooled within tests, this effect was found for every test. Thus, it would 
seem that CW does what it does— increase test reliability— where it is needed 
most— at the upper end of the score distribution. 

The Validity of CW Procedure * Almost as soon a 3 CW was developed, the 
construct validity of this procedure was challenged. Indeed, Swineford's 
first paper on the subject (1938) was entitled, "The Measurement of a 
Personality Trait*" She contended that CW merely confounds achievement 
with an irrelevant personality trait— willingness to tako risks (in a 
competitive academic setting)* Jacobs (1968) subs tantially replicated 
her methodology and came to substantially the same conclusions* The impli- 
cations of these conclusions are clears two hypothetical students of equal 
"true" ability, one "confident" and one "diffident," would appear to be of 
unequal ability under CW procedure 5 boldness could eclipse wisdom. 

Garvin's (1969) study hypothesized that, under conditions of earnest 
aoademic competition, relevant knowledge, confidence in one's knowledge, 
and willingness to manifest such confidence under the contingencies of 
extra credit vs score penalty are all highly and positively correlated * 

To test this hypothesis, he defined a subject's weighted score (X w ) minus 
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his raw score (X ) as a measure of whatever it is that the CW procedure, its- 

T 

3 elf, measures and he defined alone, as the a prior i measure of whatever 
it was that the test, itself, measured. The product-moment correlation be- 
tween this gain (or loss) due to CW and the raw scores, Xp, on the test was 
taken as a measure of concurrent validity for the CW procedure. Over the 
five tests involved in his study, these correlations ranged from +.U9 to +,35 
with a mean of +,69, It was concluded that CW measures more of the same thing 
that the test itself measures— relevant knowledge. 

This rationale has been challenged on the grounds that the score compon- 
* 

ent due to CW, X^- Xp, is not independent of the raw score, Xp. This is quite 
true. The CW score compent for an individual is the resultant of his willing- 
ness to weigtt a given item and the probability of his being right when he does 
weight it, summed over all items. Willingness to weight and the probability 
of being right have been found to be highly correlated. The probability of 
being right, summed over items, is Xp and Xp is an a priori measure of rele- 
vant knowledge* Thus, the CW score component is related to Xp (and, so, to 
relevant knowledge) through the intervening variable, willingness to weight. 
If this were, in fact, a personality trait, uncorrelated with relevant know- 
ledge, the high positive correlations found between the CW score component 
and raw scores would not have occurred. This expatiation of the writer’s 
rationale for the empirical concurrent validity of CW does not settle this 
issue once and for all. It is simply one way to think about it. In the end, 
we must be pragmatic and look to the reliability coefficients involved. If 
we believe in these coefficients at all and in the dependence of validity on 
reliability, we must see some good in any testing procedure that quite con- 
sistently yields a higher reliability than conventional procedure would, 
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There is one more factor involved here that deserves consideration. CW, 
itself, may be said to have a kind of intrinsic validity. In certain content 
areas it can be just as important to know how confident a person is of his 
knowledge as it is to know how much knowledge he actually possesses. Con- 
sider, for example, the case of spelling. Imagine that two people spell a 
given word correctly on a spelling test. One was confident of his answer 
and would have weighted it under CW procedure} the other was not at all sure 
of his answer and would not have weighted it. Now, imagine, instead, that 
each of these two people was drafting a sentence in which this test word was 
appropriate. The first person would probably use the appropriate word; the 
second would probably use a less appropriate substitute that he was sure he 
could spell (or he would go to a dictionary, if one were available, and look 
it up— only to find that his '•hunch" was right). It is of little practical 

value that he could, in fact, have spelled the original word correctly if he 

% 

were forced to try. Imagine, next, that two other people spell this same 
word wrong on this test. One was quite unsure of his answer and would not 
have weighted it under Of procedure} the other was very sure that his answer 
was right and would have weighted it. Now, imagine, instead, that these two 
people are each drafting a sentence in which this word was appropriate. The 
first of these two people would probably substitute another word that he knew 
he could spell ( or would consult a dictionary, if available)} the second 
would probably go right ahead and make a glaring error— and never check it. 
Certainly, there is an important practical difference between the states of 
knowledge of the two people who both spelled the word correctly on the test 
and between the two who both spelled it wrong. A good teacher would do dif- 
ferent things about each of these four people—if he knew that these four 
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different states of affairs existed* Of provides the teacher with a direct 
indication of eadh of these four states of affairs* 



The importance of knowing the relationship between the state of a man’s 
knowledge and his confidence therein and of doing different things about each 
combination of these variables was recognized long ago in an arabic maxim: 



He who knows not, and knows not that 
he knows not, is a fool* Shun him* 

He who knows not, and knows that he 
knows not, is simple* Teach him* 

He who knows, and knows not that he 
knows, is asleep* Waken him* 

> 

I 

He who knows, and knows that he 
knows, is wise* Follow him* 
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